{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using numerical and categorical variables together\n",
    "\n",
    "In the previous notebooks, we showed the required preprocessing to apply when\n",
    "dealing with numerical and categorical variables. However, we decoupled the\n",
    "process to treat each type individually. In this notebook, we show how to\n",
    "combine these preprocessing steps.\n",
    "\n",
    "We first load the entire adult census dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "# drop the duplicated column `\"education-num\"` as stated in the first notebook\n",
    "adult_census = adult_census.drop(columns=\"education-num\")\n",
    "\n",
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "\n",
    "data = adult_census.drop(columns=[target_name])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection based on data types\n",
    "\n",
    "We separate categorical and numerical variables using their data types to\n",
    "identify them, as we saw previously that `object` corresponds to categorical\n",
    "columns (strings). We make use of `make_column_selector` helper to select the\n",
    "corresponding columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_selector as selector\n",
    "\n",
    "numerical_columns_selector = selector(dtype_exclude=object)\n",
    "categorical_columns_selector = selector(dtype_include=object)\n",
    "\n",
    "numerical_columns = numerical_columns_selector(data)\n",
    "categorical_columns = categorical_columns_selector(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition caution alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
    "<p>Here, we know that <tt class=\"docutils literal\">object</tt> data type is used to represent strings and thus\n",
    "categorical features. Be aware that this is not always the case. Sometimes\n",
    "<tt class=\"docutils literal\">object</tt> data type could contain other types of information, such as dates that\n",
    "were not properly formatted (strings) and yet relate to a quantity of\n",
    "elapsed  time.</p>\n",
    "<p class=\"last\">In a more general scenario you should manually introspect the content of your\n",
    "dataframe not to wrongly use <tt class=\"docutils literal\">make_column_selector</tt>.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dispatch columns to a specific processor\n",
    "\n",
    "In the previous sections, we saw that we need to treat data differently\n",
    "depending on their nature (i.e. numerical or categorical).\n",
    "\n",
    "Scikit-learn provides a `ColumnTransformer` class which sends specific\n",
    "columns to a specific transformer, making it easy to fit a single predictive\n",
    "model on a dataset that combines both kinds of variables together\n",
    "(heterogeneously typed tabular data).\n",
    "\n",
    "We first define the columns depending on their data type:\n",
    "\n",
    "* **one-hot encoding** is applied to categorical columns. Besides, we use\n",
    "  `handle_unknown=\"ignore\"` to solve the potential issues due to rare\n",
    "  categories.\n",
    "* **numerical scaling** numerical features which will be standardized.\n",
    "\n",
    "Now, we create our `ColumnTransfomer` using the helper function\n",
    "`make_column_transformer`. We specify two values: the transformer, and the\n",
    "columns. First, let's create the preprocessors for the numerical and\n",
    "categorical parts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
    "\n",
    "categorical_preprocessor = OneHotEncoder(handle_unknown=\"ignore\")\n",
    "numerical_preprocessor = StandardScaler()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we create the transformer and associate each of these preprocessors with\n",
    "their respective columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "\n",
    "preprocessor = make_column_transformer(\n",
    "    (categorical_preprocessor, categorical_columns),\n",
    "    (numerical_preprocessor, numerical_columns),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a minute to represent graphically the structure of a\n",
    "`ColumnTransformer`:\n",
    "\n",
    "![columntransformer diagram](../figures/api_diagram-columntransformer.svg)\n",
    "\n",
    "A `ColumnTransformer` does the following:\n",
    "\n",
    "* It **splits the columns** of the original dataset based on the column names\n",
    "  or indices provided. We obtain as many subsets as the number of transformers\n",
    "  passed into the `ColumnTransformer`.\n",
    "* It **transforms each subsets**. A specific transformer is applied to each\n",
    "  subset: it internally calls `fit_transform` or `transform`. The output of\n",
    "  this step is a set of transformed datasets.\n",
    "* It then **concatenates the transformed datasets** into a single dataset.\n",
    "\n",
    "The important thing is that `ColumnTransformer` is like any other scikit-learn\n",
    "transformer. In particular it can be combined with a classifier in a\n",
    "`Pipeline`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final model is more complex than the previous models but still follows the\n",
    "same API (the same set of methods that can be called by the user):\n",
    "\n",
    "- the `fit` method is called to preprocess the data and then train the\n",
    "  classifier of the preprocessed data;\n",
    "- the `predict` method makes predictions on new data;\n",
    "- the `score` method is used to predict on the test data and compare the\n",
    "  predictions to the expected test labels to compute the accuracy.\n",
    "\n",
    "Let's start by splitting our data into train and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<div class=\"admonition caution alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
    "<p class=\"last\">Be aware that we use <tt class=\"docutils literal\">train_test_split</tt> here for didactic purposes, to show\n",
    "the scikit-learn API. In a real setting one might prefer to use\n",
    "cross-validation to also be able to evaluate the uncertainty of our estimation\n",
    "of the generalization performance of a model, as previously demonstrated.</p>\n",
    "</div>\n",
    "\n",
    "Now, we can train the model on the train set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = model.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we can send the raw dataset straight to the pipeline. Indeed, we do not\n",
    "need to make any manual preprocessing (calling the `transform` or\n",
    "`fit_transform` methods) as it is already handled when calling the `predict`\n",
    "method. As an example, we predict on the five first samples from the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict(data_test)[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_test[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get directly the accuracy score, we need to call the `score` method. Let's\n",
    "compute the accuracy score on the entire test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.score(data_test, target_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation of the model with cross-validation\n",
    "\n",
    "As previously stated, a predictive model should be evaluated by\n",
    "cross-validation. Our model is usable with the cross-validation tools of\n",
    "scikit-learn as any other predictors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "cv_results = cross_validate(model, data, target, cv=5)\n",
    "cv_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The compound model has a higher predictive accuracy than the two models that\n",
    "used numerical and categorical variables in isolation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting a more powerful model\n",
    "\n",
    "**Linear models** are nice because they are usually cheap to train, **small**\n",
    "to deploy, **fast** to predict and give a **good baseline**.\n",
    "\n",
    "However, it is often useful to check whether more complex models such as an\n",
    "ensemble of decision trees can lead to higher predictive performance. In this\n",
    "section we use such a model called **gradient-boosting trees** and evaluate\n",
    "its generalization performance. More precisely, the scikit-learn model we use\n",
    "is called `HistGradientBoostingClassifier`. Note that boosting models will be\n",
    "covered in more detail in a future module.\n",
    "\n",
    "For tree-based models, the handling of numerical and categorical variables is\n",
    "simpler than for linear models:\n",
    "* we do **not need to scale the numerical features**\n",
    "* using an **ordinal encoding for the categorical variables** is fine even if\n",
    "  the encoding results in an arbitrary ordering\n",
    "\n",
    "Therefore, for `HistGradientBoostingClassifier`, the preprocessing pipeline is\n",
    "slightly simpler than the one we saw earlier for the `LogisticRegression`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "\n",
    "categorical_preprocessor = OrdinalEncoder(\n",
    "    handle_unknown=\"use_encoded_value\", unknown_value=-1\n",
    ")\n",
    "\n",
    "preprocessor = make_column_transformer(\n",
    "    (categorical_preprocessor, categorical_columns),\n",
    "    remainder=\"passthrough\",\n",
    ")\n",
    "\n",
    "model = make_pipeline(preprocessor, HistGradientBoostingClassifier())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we created our model, we can check its generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "_ = model.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.score(data_test, target_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe that we get significantly higher accuracies with the Gradient\n",
    "Boosting model. This is often what we observe whenever the dataset has a large\n",
    "number of samples and limited number of informative features (e.g. less than\n",
    "1000) with a mix of numerical and categorical variables.\n",
    "\n",
    "This explains why Gradient Boosted Machines are very popular among datascience\n",
    "practitioners who work with tabular data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we:\n",
    "\n",
    "* used a `ColumnTransformer` to apply different preprocessing for categorical\n",
    "  and numerical variables;\n",
    "* used a pipeline to chain the `ColumnTransformer` preprocessing and logistic\n",
    "  regression fitting;\n",
    "* saw that **gradient boosting methods** can outperform **linear models**."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}